Goto

Collaborating Authors

 assumption 6


A Corrections to the main paper 2 2 B Problem setup 3

Neural Information Processing Systems

In the course of preparing the supplementary materials we identified the following two mistakes. For the convenience of the reader we provide the full, corrected table below. C is an appropriatly chosen constant. Frei et al. (2022) Xu & Gu (2023) Theorem 3.1 Theorem 3.6 Theorem 3.8n C log null 1 δ null log null m δ null 1 δ 1 log null m δ null m C 1 log null n δ null log null n δ null log null n δ null log null n δ null γ 1 C 1 n 1 n 1 n 1 nd 1 k γ C 1 nd null log( The same mistake also means that the sentence starting on line 188 "Comparing In order to provide a convenient reference for the reader, we summarize our notation as follows. As such we typically resort to using a generically large enough constant C . For the reader's convenience we recap the data model studied in this work. We assume test data are drawn mutually i.i.d. In regard to the initialization of the network weights, for convenience we assume each neuron's To this end, we introduce the following notation, where p { 1, 1}. P(( B < κT) (T > 0) | w, v > 0) 1 P( T = 0 | w, v > 0) P( B κT | w, v > 0), therefore it suffices to upper bound the two probabilities on the right-hand-side. Using a variant of Hoeffding's bound for sampling without replacement (see Proposition Based on Lemma B.2, the following lemma bounds the probability that " on the counting functions: in particular we write P (i, l) + P (i, l) = P ( i, i) = 1 /2 and hence we conclude p + q = 1 / 2. As a result Observe by the data model, described in Section B.2, that We will often make use of the following similar but more pessimistic bounds on the activations.


A Generalized Adaptive Joint Learning Framework for High-Dimensional Time-Varying Models

Chen, Baolin, Ran, Mengfei

arXiv.org Machine Learning

In modern biomedical and econometric studies, longitudinal processes are often characterized by complex time-varying associations and abrupt regime shifts that are shared across correlated outcomes. Standard functional data analysis (FDA) methods, which prioritize smoothness, often fail to capture these dynamic structural features, particularly in high-dimensional settings. This article introduces Adaptive Joint Learning (AJL), a hierarchical regularization framework designed to integrate functional variable selection with structural changepoint detection in multivariate time-varying coefficient models. Unlike standard simultaneous estimation approaches, we propose a theoretically grounded two-stage screening-and-refinement procedure. This framework first synergizes adaptive group-wise penalization with sure screening principles to robustly identify active predictors, followed by a refined fused regularization step that effectively borrows strength across multiple outcomes to detect local regime shifts. We provide a rigorous theoretical analysis of the estimator in the ultra-high-dimensional regime (p >> n). Crucially, we establish the sure screening consistency of the first stage, which serves as the foundation for proving that the refined estimator achieves the oracle property-performing as well as if the true active set and changepoint locations were known a priori. A key theoretical contribution is the explicit handling of approximation bias via undersmoothing conditions to ensure valid asymptotic inference. The proposed method is validated through comprehensive simulations and an application to Sleep-EDF data, revealing novel dynamic patterns in physiological states.


Sharp Structure-Agnostic Lower Bounds for General Functional Estimation

Jin, Jikai, Syrgkanis, Vasilis

arXiv.org Machine Learning

The design of efficient nonparametric estimators has long been a central problem in statistics, machine learning, and decision making. Classical optimal procedures often rely on strong structural assumptions, which can be misspecified in practice and complicate deployment. This limitation has sparked growing interest in structure-agnostic approaches -- methods that debias black-box nuisance estimates without imposing structural priors. Understanding the fundamental limits of these methods is therefore crucial. This paper provides a systematic investigation of the optimal error rates achievable by structure-agnostic estimators. We first show that, for estimating the average treatment effect (ATE), a central parameter in causal inference, doubly robust learning attains optimal structure-agnostic error rates. We then extend our analysis to a general class of functionals that depend on unknown nuisance functions and establish the structure-agnostic optimality of debiased/double machine learning (DML). We distinguish two regimes -- one where double robustness is attainable and one where it is not -- leading to different optimal rates for first-order debiasing, and show that DML is optimal in both regimes. Finally, we instantiate our general lower bounds by deriving explicit optimal rates that recover existing results and extend to additional estimands of interest. Our results provide theoretical validation for widely used first-order debiasing methods and guidance for practitioners seeking optimal approaches in the absence of structural assumptions. This paper generalizes and subsumes the ATE lower bound established in \citet{jin2024structure} by the same authors.


Comparing Two Proxy Methods for Causal Identification

Guo, Helen, Ogburn, Elizabeth L., Shpitser, Ilya

arXiv.org Machine Learning

Identifying causal effects in the presence of unmeasured variables is a fundamental challenge in causal inference, for which proxy variable methods have emerged as a powerful solution. We contrast two major approaches in this framework: (1) bridge equation methods, which leverage solutions to integral equations to recover causal targets, and (2) array decomposition methods, which recover latent factors composing counterfactual quantities by exploiting unique determination of eigenspaces. We compare the model restrictions underlying these two approaches and provide insight into implications of the underlying assumptions, clarifying the scope of applicability for each method.


Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning

Liu, Bing, Kong, Boao, Lu, Limin, Yuan, Kun, Zhao, Chengcheng

arXiv.org Artificial Intelligence

Decentralized learning often involves a weighted global loss with heterogeneous node weights $λ$. We revisit two natural strategies for incorporating these weights: (i) embedding them into the local losses to retain a uniform weight (and thus a doubly stochastic matrix), and (ii) keeping the original losses while employing a $λ$-induced row-stochastic matrix. Although prior work shows that both strategies yield the same expected descent direction for the global loss, it remains unclear whether the Euclidean-space guarantees are tight and what fundamentally differentiates their behaviors. To clarify this, we develop a weighted Hilbert-space framework $L^2(λ;\mathbb{R}^d)$ and obtain convergence rates that are strictly tighter than those from Euclidean analysis. In this geometry, the row-stochastic matrix becomes self-adjoint whereas the doubly stochastic one does not, creating additional penalty terms that amplify consensus error, thereby slowing convergence. Consequently, the difference in convergence arises not only from spectral gaps but also from these penalty terms. We then derive sufficient conditions under which the row-stochastic design converges faster even with a smaller spectral gap. Finally, by using a Rayleigh-quotient and Loewner-order eigenvalue comparison, we further obtain topology conditions that guarantee this advantage and yield practical topology-design guidelines.



A Guide Through the Zoo of Biased SGD

Neural Information Processing Systems

We also provide examples where biased estimators outperform their unbiased counterparts or where unbiased versions are simply not available. Finally, we demonstrate the effectiveness of our framework through experimental results that validate our theoretical findings.


R1, R2, R4: Suggest more extensive analysis on Assumption 2 and the normalization step in Algorithm 1

Neural Information Processing Systems

We would like to thank the reviewers for their insightful feedback. In the following, we address their key concerns. Following reviewers' suggestions, we will add more thorough analysis in the final paper. Its advantages and applications are then limited. Mixup was introduced in VPU as a regularizer to solve the overfitting problem (Table 4 and Lines 100-105, 376-384).


Appendix T able of Contents

Neural Information Processing Systems

We prove the result for each of the three possible cases of the loss function. Lemma A.3, for ( x,y) X Y, we have p Using Lemma A.2, we have The ASO formulation above motivated the authors of [59] Note that when Θ is a full rank matrix, this decomposition is unique. Several personalized FL formulations, e.g., D.1 Client-Server Algorithm Alg. 2 is a detailed version of Alg. 1 ( FedEM), with local SGD used as local solver. Alg. 3 gives our general algorithm for federated surrogate optimization, from which Alg. 2 is derived.Algorithm 2: FedEM: Federated Expectation-MaximizationInput: Data S Alg. 5 gives our general fully decentralized algorithm for federated surrogate optimization, from As mentioned in Section 3.3, the convergence of decentralized optimization schemes requires certain In our paper, we consider the following general assumption. We provide below the rigorous statement of Theorem 3.3, which was informally presented in's iterates satisfy the following inequalities after a large enough number of In particular, we provide the assumptions under which Alg. 3 and Alg. 5 converge.